{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# \ud83d\udcc3 Solution for Exercise M5.02\n",
    "\n",
    "The aim of this exercise is to find out whether a decision tree model is able\n",
    "to extrapolate.\n",
    "\n",
    "By extrapolation, we refer to values predicted by a model outside of the range\n",
    "of feature values seen during the training.\n",
    "\n",
    "We first load the regression data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "penguins = pd.read_csv(\"../datasets/penguins_regression.csv\")\n",
    "\n",
    "feature_name = \"Flipper Length (mm)\"\n",
    "target_name = \"Body Mass (g)\"\n",
    "data_train, target_train = penguins[[feature_name]], penguins[target_name]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"admonition note alert alert-info\">\n",
    "<p class=\"first admonition-title\" style=\"font-weight: bold;\">Note</p>\n",
    "<p class=\"last\">If you want a deeper overview regarding this dataset, you can refer to the\n",
    "Appendix - Datasets description section at the end of this MOOC.</p>\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First, create two models, a linear regression model and a decision tree\n",
    "regression model, and fit them on the training data. Limit the depth at 3\n",
    "levels for the decision tree."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# solution\n",
    "from sklearn.linear_model import LinearRegression\n",
    "from sklearn.tree import DecisionTreeRegressor\n",
    "\n",
    "linear_regression = LinearRegression()\n",
    "tree = DecisionTreeRegressor(max_depth=3)\n",
    "\n",
    "linear_regression.fit(data_train, target_train)\n",
    "tree.fit(data_train, target_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Create a synthetic dataset containing all possible flipper length from the\n",
    "minimum to the maximum of the training dataset. Get the predictions of each\n",
    "model using this dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# solution\n",
    "import numpy as np\n",
    "\n",
    "data_test = pd.DataFrame(\n",
    "    np.arange(data_train[feature_name].min(), data_train[feature_name].max()),\n",
    "    columns=[feature_name],\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "solution"
    ]
   },
   "outputs": [],
   "source": [
    "target_predicted_linear_regression = linear_regression.predict(data_test)\n",
    "target_predicted_tree = tree.predict(data_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Create a scatter plot containing the training samples and superimpose the\n",
    "predictions of both models on the top."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# solution\n",
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns\n",
    "\n",
    "sns.scatterplot(\n",
    "    data=penguins, x=feature_name, y=target_name, color=\"black\", alpha=0.5\n",
    ")\n",
    "plt.plot(\n",
    "    data_test[feature_name],\n",
    "    target_predicted_linear_regression,\n",
    "    label=\"Linear regression\",\n",
    ")\n",
    "plt.plot(data_test[feature_name], target_predicted_tree, label=\"Decision tree\")\n",
    "plt.legend()\n",
    "_ = plt.title(\"Prediction of linear model and a decision tree\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": [
     "solution"
    ]
   },
   "source": [
    "The predictions that we got were within the range of feature values seen\n",
    "during training. In some sense, we observe the capabilities of our model to\n",
    "interpolate."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, we check the extrapolation capabilities of each model. Create a dataset\n",
    "containing a broader range of values than your previous dataset, in other\n",
    "words, add values below and above the minimum and the maximum of the flipper\n",
    "length seen during training."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# solution\n",
    "offset = 30\n",
    "data_test = pd.DataFrame(\n",
    "    np.arange(\n",
    "        data_train[feature_name].min() - offset,\n",
    "        data_train[feature_name].max() + offset,\n",
    "    ),\n",
    "    columns=[feature_name],\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, make predictions with both models on this new interval of data.\n",
    "Repeat the plotting of the previous exercise."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# solution\n",
    "target_predicted_linear_regression = linear_regression.predict(data_test)\n",
    "target_predicted_tree = tree.predict(data_test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "solution"
    ]
   },
   "outputs": [],
   "source": [
    "sns.scatterplot(\n",
    "    data=penguins, x=feature_name, y=target_name, color=\"black\", alpha=0.5\n",
    ")\n",
    "plt.plot(\n",
    "    data_test[feature_name],\n",
    "    target_predicted_linear_regression,\n",
    "    label=\"Linear regression\",\n",
    ")\n",
    "plt.plot(data_test[feature_name], target_predicted_tree, label=\"Decision tree\")\n",
    "plt.legend()\n",
    "_ = plt.title(\"Prediction of linear model and a decision tree\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": [
     "solution"
    ]
   },
   "source": [
    "The linear model extrapolates using the fitted model for flipper lengths < 175\n",
    "mm and > 235 mm. In fact, we are using the model parametrization to make these\n",
    "predictions.\n",
    "\n",
    "As mentioned, decision trees are non-parametric models and we observe that\n",
    "they cannot extrapolate. For flipper lengths below the minimum, the mass of\n",
    "the penguin in the training data with the shortest flipper length will always\n",
    "be predicted. Similarly, for flipper lengths above the maximum, the mass of\n",
    "the penguin in the training data with the longest flipper will always be\n",
    "predicted."
   ]
  }
 ],
 "metadata": {
  "jupytext": {
   "main_language": "python"
  },
  "kernelspec": {
   "display_name": "Python 3",
   "name": "python3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}